36 research outputs found

    Context-aware and Scale-insensitive Temporal Repetition Counting

    Full text link
    Temporal repetition counting aims to estimate the number of cycles of a given repetitive action. Existing deep learning methods assume repetitive actions are performed in a fixed time-scale, which is invalid for the complex repetitive actions in real life. In this paper, we tailor a context-aware and scale-insensitive framework, to tackle the challenges in repetition counting caused by the unknown and diverse cycle-lengths. Our approach combines two key insights: (1) Cycle lengths from different actions are unpredictable that require large-scale searching, but, once a coarse cycle length is determined, the variety between repetitions can be overcome by regression. (2) Determining the cycle length cannot only rely on a short fragment of video but a contextual understanding. The first point is implemented by a coarse-to-fine cycle refinement method. It avoids the heavy computation of exhaustively searching all the cycle lengths in the video, and, instead, it propagates the coarse prediction for further refinement in a hierarchical manner. We secondly propose a bidirectional cycle length estimation method for a context-aware prediction. It is a regression network that takes two consecutive coarse cycles as input, and predicts the locations of the previous and next repetitive cycles. To benefit the training and evaluation of temporal repetition counting area, we construct a new and largest benchmark, which contains 526 videos with diverse repetitive actions. Extensive experiments show that the proposed network trained on a single dataset outperforms state-of-the-art methods on several benchmarks, indicating that the proposed framework is general enough to capture repetition patterns across domains.Comment: Accepted by CVPR202

    SINet: A Scale-insensitive Convolutional Neural Network for Fast Vehicle Detection

    Full text link
    Vision-based vehicle detection approaches achieve incredible success in recent years with the development of deep convolutional neural network (CNN). However, existing CNN based algorithms suffer from the problem that the convolutional features are scale-sensitive in object detection task but it is common that traffic images and videos contain vehicles with a large variance of scales. In this paper, we delve into the source of scale sensitivity, and reveal two key issues: 1) existing RoI pooling destroys the structure of small scale objects, 2) the large intra-class distance for a large variance of scales exceeds the representation capability of a single network. Based on these findings, we present a scale-insensitive convolutional neural network (SINet) for fast detecting vehicles with a large variance of scales. First, we present a context-aware RoI pooling to maintain the contextual information and original structure of small scale objects. Second, we present a multi-branch decision network to minimize the intra-class distance of features. These lightweight techniques bring zero extra time complexity but prominent detection accuracy improvement. The proposed techniques can be equipped with any deep network architectures and keep them trained end-to-end. Our SINet achieves state-of-the-art performance in terms of accuracy and speed (up to 37 FPS) on the KITTI benchmark and a new highway dataset, which contains a large variance of scales and extremely small objects.Comment: Accepted by IEEE Transactions on Intelligent Transportation Systems (T-ITS

    RGB-D Visual Saliency Detection Algorithm Based on Information Guided and Multimodal Feature Fusion

    No full text
    With the development of scientific information technology and the popularization of electronic devices, images and videos have become very important forms of information expression and carriers in our current lives. Accelerating the mining of valuable information content from massive data has become a very important aspect of current computer vision research. The saliency object detection method, which is related to human visual attention, is gradually being applied in computer processing. However, in current color depth models, the association mining of data depth clues is still far from sufficient, and there is still significant room for improvement in image quality. Based on this, an improved color depth detection model is proposed for information guided and multi feature fusion, and an absorption Markov model is introduced to optimize the guidance of low-level, middle-level, and high-level saliency maps, grasping different feature information contents. Subsequently, the gradual guidance of the network is achieved from aspects such as feature encoding, multi-scale and multi attention models, and attention refinement mechanisms. The experimental analysis of the fusion model proposed in the study showed that the average classification improvement accuracy of the fusion model reached 5.23%, and its error value was less than 0.1. The effectiveness on all four quantitative indicators exceeded 92%. The system’s detection response rate exceeded 93%, which is limited by the target object and results in a decrease in accuracy. This algorithm can provide reference value and means for target localization recognition and virtual scene detection

    Holistically associated transductive zero-shot learning

    No full text

    GDFace: Gated deformation for multi-view face image synthesis

    No full text
    Photorealistic multi-view face synthesis from a single image is an important but challenging problem. Existing methods mainly learn a texture mapping model from the source face to the target face. However, they fail to consider the internal deformation caused by the change of poses, leading to the unsatisfactory synthesized results for large pose variations. In this paper, we propose a Gated Deformable Face Synthesis Network to model the deformation of faces that aids the synthesis of the target face image. Specifically, we propose a dual network that consists of two modules. The first module estimates the deformation of two views in the form of convolution offsets according to the input and target poses. The second one, on the other hand, leverages the predicted deformation offsets to create the target face image. In this way, pose changes are explicitly modeled in the face generator to cope with geometric transformation, by adaptively focusing on pertinent regions of the source image. To compensate offset estimation errors, we introduce a soft-gating mechanism that enables adaptive fusion between deformable features and primitive features. Extensive experimental results on five widely-used benchmarks show that our approach performs favorably against the state-of-the-arts on multi-view face synthesis, especially for large pose changes

    Unsupervised domain adaptation via importance sampling

    No full text
    corecore